Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)#315
Merged
cocohearts merged 1 commit intoopenai:mainfrom Mar 23, 2026
Merged
Conversation
|
yes! great job this is sort of where i went too |
bopmite
added a commit
to bopmite/parameter-golf
that referenced
this pull request
Mar 21, 2026
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 21, 2026
robinojw
pushed a commit
to robinojw/parameter-golf
that referenced
this pull request
Mar 21, 2026
- Add FA3 > FA2 > SDPA attention backend dispatch - FA2 wrapper uses @torch.compiler.disable + fullgraph=False - FA3 uses fullgraph=True (compatible with torch.compile) - Default FP16_KEEP_NAME_PATTERNS empty (quantize everything, matches PR openai#315) - Add pod_setup.sh with FA3/FA2 install flow - Add build_fa3_wheel.sh for pre-building FA3 on cheap 1xH100
filipviz
added a commit
to filipviz/parameter-golf
that referenced
this pull request
Mar 21, 2026
Rename folder to today's date. Replace train_gpt.py with the new baseline from PR openai#315 (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Previous script preserved as previous_train_gpt.py. Update README with PR lineage and new baseline context.
filipviz
added a commit
to filipviz/parameter-golf
that referenced
this pull request
Mar 21, 2026
…unner Port per-head gated attention (12ch, 2*sigmoid) into the PR openai#315 train_gpt.py (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Update run script to use PR openai#315 config for both baseline and experiment.
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 21, 2026
3 tasks
152334H
reviewed
Mar 21, 2026
records/track_10min_16mb/2026-03-21_11L_XSA4_EMA_PartialRoPE_LateQAT_1.1248/train_gpt.py
Show resolved
Hide resolved
felipe-parodi
added a commit
to felipe-parodi/parameter-golf
that referenced
this pull request
Mar 21, 2026
- Rebased train_gpt.py on PR openai#315 (1.1248 BPB SOTA) - Added SGD TTT and causal TTT variant - Added gradient-guided adaptive quantization (int5/int6/int7) - Added z-loss regularization - Updated plan with current landscape and run commands
dfb05a5 to
2951651
Compare
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 21, 2026
Merged records from all experiment branches into one working branch. Updated CLAUDE.md with current competitive landscape and next priorities. Rewrote idea bank with tiered roadmap for closing the gap to openai#315.
felipe-parodi
added a commit
to felipe-parodi/parameter-golf
that referenced
this pull request
Mar 21, 2026
alia-abbas
added a commit
to alia-abbas/parameter-golf
that referenced
this pull request
Mar 21, 2026
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.
charmquark1984
added a commit
to charmquark1984/parameter-golf
that referenced
this pull request
Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.
charmquark1984
added a commit
to charmquark1984/parameter-golf
that referenced
this pull request
Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.
charmquark1984
added a commit
to charmquark1984/parameter-golf
that referenced
this pull request
Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.
turazashvili
added a commit
to turazashvili/parameter-golf
that referenced
this pull request
Mar 22, 2026
Safe config matching PR openai#315 proven techniques: - 11 layers, MLP 3x (1536), BigramHash 2048 - Muon backend_steps=5, momentum=0.99 (proven by all top PRs) - XSA on last 4 layers, Partial RoPE 16/64, LN Scale, Late QAT - EMA decay=0.997 every 4 steps via torch._foreach_lerp_ - CUDA_DEVICE_MAX_CONNECTIONS=1 for multi-GPU overlap - SmearGate, OrthoInit, int5 MLP/int6 attention, zstd-22
EthanYangTW
added a commit
to EthanYangTW/parameter-golf
that referenced
this pull request
Mar 22, 2026
…le, EMA, Late QAT, TTT Major rewrite targeting top-5 leaderboard: - 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB - XSA (Exclusive Self-Attention) on last 4 layers - Partial RoPE: 16/64 head dims get position encoding - LN Scale: 1/sqrt(layer+1) dampening on deeper layers - EMA (decay=0.997) replaces SWA - Late QAT: STE int6 enabled only in final 4% of training - TTT: 25-epoch SGD on val data post-quantization - FA3 auto-detection with SDPA fallback - Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)
3 tasks
mrdavtan
added a commit
to mrdavtan/parameter-golf
that referenced
this pull request
Mar 23, 2026
… 3 seeds) AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups (3x for MLP output projections, 0.5x for input projections). 34 TTT configurations tested. FINDINGS.md documents 31 experiments including negative results on codebook quantization, symmetry-transport, layer dropping, focal loss, and KL divergence TTT. Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
alia-abbas
added a commit
to alia-abbas/parameter-golf
that referenced
this pull request
Mar 23, 2026
6 tasks
6 tasks
This was referenced Mar 23, 2026
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Mar 23, 2026
ROPE_DIMS=16: apply rotary to 25% of head dims, rest position-free LN_SCALE=1: scale RMSNorm output by 1/sqrt(layer+1) Both env-var gated, default off — existing runs unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Mar 23, 2026
ROPE_DIMS=16: apply rotary to 25% of head dims, rest position-free LN_SCALE=1: scale RMSNorm output by 1/sqrt(layer+1) Both env-var gated, default off — existing runs unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
nvemuri4649
pushed a commit
to thanushpatlolla/parameter-golf
that referenced
this pull request
Mar 27, 2026
…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)
val_bpb: 1.1248 (sliding window, stride=64) | 15.6 MB | 8xH100 SXM, 600s
Progress from prior submissions
Two new techniques on top of PR #287's 11-layer stack.
Key additions over PR #287
Everything else from PR #287 carries forward: 11 layers, XSA on last 4 layers, EMA (0.997), OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.
Results
Reproducibility (3 seeds)
Mean: 1.1250 | Variance: 0.0005 | Submitted: seed 2025
Run command
Note on Late QAT
The submitted code includes a Late QAT flag (
LATE_QAT=1) intended to enable STE int6 fake-quantization in the final 4% of training. Post-submission analysis (credit: @152334H) revealed thattorch.compileconstant-folds theCastedLinear._qat_enabledclass attribute at first trace, so the STE branch is dead-code-eliminated and never activates during training. Late QAT had no effect on the results. The score is driven entirely by Partial RoPE and LN Scale.